Welcome to this one-day workshop: A crash course on using machine learning methods effectively in practice
Some materials and illustration are based on Chapter 2 and 3 of The Mathematical Engineering of Deep Learning
Data are available at https://github.com/benoit-liquet/SSA-MAPS-ML
Tom Mitchell (1998): ``A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.’’
Stephen Boyd (2016): ``machine learning: extract information directly from historical data and extrapolate to achieve a task ( e.g. make predictions)’’
Consider the data from Breast Cancer Wisconsin (Diagnostic) (WBCD) dataset
Goal is to predict if a benign (Y = 0) or a malignant (Y = 1) lumps of a breast mass.
Historical data consists of a large number of patient records
30 (=\(d\)) characteristics of individual cells of breast cancer
Use a machine learning algorithm that observes these data, produces a predictor
Predictor takes as input 30 values, returns a single Boolean prediction (0 or 1)
This is a classifier, since we are predicting an outcome that takes only two values
Evaluate model by its error rate on a separate test set of data, not used to develop the model
A probabilistic model returns a probability that the patient has a malignant lump, not just a Boolean
Sigmoid model is a logistic model (more details latter)
It gives a probability as output:
\[\begin{equation} \label{eq:first-shallow-view} \hat{y}=\underbrace{\sigma\left(\overbrace{b+w^\top x}^{z}\right)}_{a}. \end{equation}\]
You can turn into a classifier
\[\begin{equation} \label{eq:hyper-plane-based-classifier} \widehat{\cal Y} = \begin{cases} 0~ (\texttt{Benign}), & \text{if} \quad \hat{y} \le 0.5, \\ 1~ (\texttt{Malignan}), & \text{if} \quad \hat{y} > 0.5 \end{cases} \end{equation}\]
We will run the sigmoid model using generalized linear model framework
Keep in mind sigmoid model is a logistic model
Lets run a model with one feature and a model with 10 features
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
[1] 569 32
data <- data %>% mutate(diagnosis_0_1 = ifelse(diagnosis == "M", 1, 0))
var <- c(1,2,5,6,9,10,11,12,15,16,25,31)
data <- data[,-c(1,2)]
data <- data[,var]
colnames(data) [1] "radius_mean" "texture_mean" "smoothness_mean"
[4] "compactness_mean" "symmetry_mean" "fractal_dimension_mean"
[7] "radius_se" "texture_se" "smoothness_se"
[10] "compactness_se" "smoothness_worst" "diagnosis_0_1"
Loading required package: ggplot2
Loading required package: lattice
set.seed(101)
training <- createDataPartition(data$diagnosis_0_1, p=0.8, list=FALSE)
train <- data[ training, ] #--> training data
test <- data[ -training, ] #--> test data
trainfull <- train[,-c(1,2)]
uni.model <- glm(diagnosis_0_1~smoothness_worst,family=binomial,data=trainfull)
full.model <- glm(diagnosis_0_1~.,family=binomial,data=trainfull)glm.fit: fitted probabilities numerically 0 or 1 occurred
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.341524 3.9417679 1.608802 1.076597e-01
smoothness_mean -205.038026 51.1673877 -4.007201 6.144247e-05
compactness_mean 107.058369 19.8272716 5.399551 6.680780e-08
symmetry_mean 20.622891 14.0976782 1.462857 1.435064e-01
fractal_dimension_mean -478.435425 96.7456145 -4.945293 7.602936e-07
radius_se 18.153664 3.2350594 5.611539 2.005345e-08
texture_se 1.501148 0.7130725 2.105183 3.527536e-02
smoothness_se -741.412778 179.0867319 -4.139965 3.473590e-05
compactness_se -79.855639 27.4671773 -2.907311 3.645500e-03
smoothness_worst 195.309199 33.1556122 5.890683 3.846028e-09
29 35 36 41 45 48 61
0.999963726 0.990044782 0.999992477 0.033481026 0.868105685 0.921966302 0.002620119
63 66 68
0.999929529 0.984120111 0.004692286
my.classifier <- function(model,patient,threshold=0.5) ifelse(predict(model, patient, type="response")>threshold,1,0)
my.classifier(full.model,new.patients[1:10,])\[\begin{equation} \label{eq:acc-def} \text{Accuracy} = \frac{\text{TP} +\text{TN}}{n_{\text{test}}}, \end{equation}\]
which can be misleading for unbalanced data (especially rare event)
pred.full <- my.classifier(full.model,test)
res.full <- confusionMatrix(as.factor(pred.full),as.factor(test$diagnosis_0_1),positive="1")
table(pred.full,test$diagnosis_0_1)
pred.full 0 1
0 70 5
1 3 35
Reference
Prediction 0 1
0 70 5
1 3 35
Accuracy
0.9292035
Sensitivity Specificity Precision Recall
0.8750000 0.9589041 0.9210526 0.8750000
pred.uni <- my.classifier(uni.model,test)
res.uni <- confusionMatrix(as.factor(pred.uni),as.factor(test$diagnosis_0_1),positive="1")
res.uni$table Reference
Prediction 0 1
0 66 24
1 7 16
Accuracy
0.7256637
Sensitivity Specificity Precision Recall
0.4000000 0.9041096 0.6956522 0.4000000
A popular way to average precision and recall is by considering the harmonic mean of the two values.
This is called the F\(_1\) score :
\[\begin{equation} \label{eq:f1-score} F_1 = \frac{2}{\frac{1}{\text{Precision}} + \frac{1}{\text{Recall}}} = 2 \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}. \end{equation}\]
F1
0.5079365
F1
0.8974359
Be aware that classifier based on threshold alters confusion matrix
Receiver Operating Characteristic (ROC) curve:
pr_full = prediction(pred.full, test$diagnosis_0_1)
perf_full = performance(pr_full, measure = "tpr", x.measure = "fpr")
pr_uni = prediction(pred.uni, test$diagnosis_0_1)
perf_uni = performance(pr_uni, measure = "tpr", x.measure = "fpr")
auc1 = performance(pr_full, measure = "auc")
auc1 = auc1@y.values[[1]]
auc2 = performance(pr_uni, measure = "auc")
auc2 = auc2@y.values[[1]]
print(paste("AUC full test", signif(auc1,digits=4)))[1] "AUC full test 0.917"
[1] "AUC uni test 0.6521"
library(ROCit)
r1=rocit(predict(uni.model, test), test$diagnosis_0_1, method = "bin")
r3=rocit(predict(full.model, test), test$diagnosis_0_1, method = "bin")
plot(r1$TPR~r1$FPR, type = "l", xlab = "False Positive Rate", lwd = 2,
ylab = "Sensitivity", col= "gold4",cex.lab=2,cex.axis = 2)
grid()
legend("bottomright", c("Perfectly Separable",
"Univariate", "Full model", "Chance Line"), cex=2,
lwd = 2, col = c("darkgreen", "gold4",
"orange", "red"), bty = "n")supervised versus unsupervised models
supervised learning models predict something, given some other things
True; Falseunsupervised learning models just create a model of the data
Distinguish between seen data and unseen data.
Seen data is available for learning, namely for training of models, model selection, parameter tuning, and testing of models.
Unseen data is essentially unlimited. All data from the real world that is not available while learning takes place but is later available when the model is used. This can be data from the future, or data that was not collected or labelled with the seen data.
Machine learning works well if nature of the seen data is similar to unseen data.
Common practice is to split the seen data into training data and testing data (also named hold out set).
Training data is used for learning.
Testing data for mimicking a scenario of unseen data.
To do recalibrate, adjust, or tune models on the training set while testing repeatedly on the test set.
Performs an additional split of the training data by removing a chunk out of the training data and calling it the validation set.
Statistical approaches generally use all available data and evaluate quality of models using selection criterion as Akaike information criterion (AIC) or *Bayesian information criterion (BIC).
Sometimes instead of using a validation set we can do k-fold cross validation
standardization of the data: subtraction of the mean of each feature and division by the standard deviation of the feature.
sample mean and sample standard deviation
\[\begin{equation} \label{eq:stats-mean-var} \overline{x}_i = \frac{1}{n} \sum_{j=1}^n x_i^{(j)}, \qquad s^2_i = \frac{1}{n} \sum_{j=1}^n (x_i^{(j)} - \overline{x}_i)^2. \end{equation}\]
\[\begin{equation} \label{eq:ref-stand-z} z^{(j)}_i = \frac{x^{(j)}_i - \overline{x}_i}{s_i} \qquad \text{for} \qquad j=1,\ldots,n. \end{equation}\]
For feature \(i\), \(z_i^{(1)}, \ldots,z_i^{(n)}\), has a sample mean of exactly \(0\) and a sample standard deviation of exactly \(1\).
Such standardization is useful as it places the dynamic range of the model inputs on a uniform scale and thus improves the numerical stability of algorithms.
We will discuss how such standardization can also help optimization performance.
Learning activity involves optimization either explicitly or implicitly.
This is because learning is the process of seeking model parameters that are best for some given task.
Optimize some performance measure that quantifies how good the model at hand performs.
Example, use mean square error criterion for regression problems
In other cases, a loss function is used such that minimization of the loss function is a proxy for minimization of the actual performance measures that are of interest.
Example, for classification problems we aim to get the most accurate classifier but optimization is typically not carried out directly on the accuracy measure but rather on a loss function such as the mean square error, or cross entropy
More common form of machine learning
Data is comprised of the features (independent variables) \(X\) and the labels (dependent variables, response variables) \(Y\).
Regression problem when the variable to be predicted \(Y\) is a continuous or numerical variable
Classification problem, in cases where \(Y\) comes from some finite set of label values.
Start simple with an example: Housing data
univariate example: \(x\) is the average number of rooms per dwelling and \(y\) is the median value of owner-occupied homes in thousands of dollars.
A regression model \(y = f_\theta(x)\) attempts to predict the median house price as a function of the average number of rooms.
\[\begin{equation} \label{eq:simple-linear-reg} y = \beta_0 + \beta_1 x + \epsilon, \end{equation}\]
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ────────────────────────────────────────────── tidyverse 1.3.1 ──
✔ tibble 3.1.7 ✔ purrr 0.3.4
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::lift() masks caret::lift()
✖ MASS::select() masks dplyr::select()
data("Boston", package = "MASS")
ind <- which(Boston$medv==50)
Boston.s <- Boston[-ind,]
train.data <- Boston.s
model1 <- lm(medv~rm,data=train.data)
p1 <- ggplot(train.data, aes(rm, medv)) +
geom_point(size=2) + stat_smooth(method = "lm", col = "red") +
xlab('Average number of rooms per dwelling (rm)') + ylab('House prices in $1000 (medv)')+
theme(legend.position = 'bottom') + theme_bw()
p1 + theme(axis.text = element_text(size = 10))+ theme(axis.title = element_text(size = 10)) \[\begin{equation} \label{eq:poly-2-reg} y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon. \end{equation}\]
model.lm <- lm(medv~lstat,data=train.data)
model.quad <- lm(medv~lstat+I(lstat^2),data=train.data)
p2 <- ggplot(train.data, aes(lstat, medv) ) +
geom_point(size=2) +
stat_smooth(method = lm, formula = y ~ poly(x, 2, raw = TRUE))+
stat_smooth(method = lm, formula = y ~ x,col='red')+
xlab('Lower status of the population in % (lstat)') + ylab('House prices in $1000 (medv)')+
theme(legend.position = 'bottom') + theme_bw()
p2 + theme(axis.text = element_text(size = 10))+ theme(axis.title = element_text(size = 10)) Consider a training dataset \(\mathcal{D}=\{(x^{(1)},y^{(1)}),\ldots,(x^{(n)},y^{(n})\}\) composed of a collection of \(n\) samples.
Design matrix \(X\) for the features, and the corresponding output response vector \(y\), via,
\[\begin{equation} \label{eq:design} X=\begin{bmatrix} \vert & \vert & &\vert \\ 1 &x_{(1)} &\dots &x_{(p)} \\ \vert & \vert & & \vert \end{bmatrix} \quad \textrm{with} \quad x_{(i)}=\begin{bmatrix} x_i^{(1)} \\ \vdots \\ x_i^{(n)} \end{bmatrix}, \quad \textrm{and} \quad y= \begin{bmatrix} y^{(1)} \\ %y^{(2)} \\ \vdots \\ y^{(n)} \end{bmatrix} . \end{equation}\]
Linear model for all the samples of the training set via \[ y=X\theta+ \epsilon, \] with \(\epsilon=(\epsilon_1,\dots,\epsilon_n)\)
Define the predicted output vector of the model for the input training data via, \[ \hat{y}=X\widehat{\theta}, \qquad \text{where} \qquad \hat{y} = \begin{bmatrix} \hat{y}^{(1)} \\ %y^{(2)} \\ \vdots \\ \hat{y}^{(n)} \end{bmatrix}. %(\hat{y}^{(1)},\ldots,\hat{y}^{(n)}). \]
A suitable learned value for \(\widehat{\theta}\) will yield \(\hat{y}\approx y\).
This closeness is captured via a loss function: \[\begin{equation} \label{eq:general-additive-loss} C(\theta \,; \, \mathcal D)=\frac{1}{n}\sum_{i=1}^nC_i(\theta) \end{equation}\] where \(C_i(\theta):=C_i(\theta \, ; \, y^{(i)}, \hat{y}^{(i)})\) is the loss for the \(i\)-th data sample. Specifically \(\widehat{\theta}\) is typically chosen so that the loss function is minimal at the point \(\theta = \widehat{\theta}\).
For the linear model: square loss function also called quadratic loss where the loss for each data sample is \[\begin{equation} \label{eq:quadratic-loss-one-obs} C_i(\theta)=({y}^{(i)} - \hat{y}^{(i)})^2. \end{equation}\]
Loss for the entire training data can be represented in terms of the \(L_2\) norm \(\|\cdot \|\) of the corresponding error vector, \[ C(\theta;\mathcal D) = \frac{1}{n}\sum_{i=1}^n({y}^{(i)} - \hat{y}^{(i)})^2 = \frac{1}{n}\|{y}-\hat{y}\|^2 \]
Optimize:
\[\begin{equation} \label{eq:least-squares-opt} \hat{\theta} =\textrm{argmin}_{\theta\in \Re^d} \frac{1}{n}{\| y - X\theta\|^2} = \textrm{argmin}_{\theta\in \Re^d}{\|y - X\theta\|}^2. \end{equation}\]
The least squares solution can be easily derived by first computing the gradient of \(||X\theta-y||^2\) with respect to \(\theta\) \[\begin{equation} \label{eq:gradLS} \begin{aligned} \frac{\partial \| y - X\theta\|^2}{\partial \theta}&=-2X^\top y+2X^\top X\theta. \end{aligned} \end{equation}\]
Setting the gradient to 0 we get
\[\begin{equation} \label{eq:normal} X^\top X\theta=X^\top y, \end{equation}\]
Unique solution when the \(d\times d\) matrix \(X^\top X\), also known as the Gram matrix of \(X\), is invertible.
In this case: \(\hat{\theta}=(X^\top X)^{-1}X^\top y\), or, by setting \(X^{\dagger} = (X^\top X)^{-1}X^\top\), we have,
\[\begin{equation} \label{eq:closedform} \hat{\theta}=X^{\dagger} y, \end{equation}\]
where \(X^{\dagger}\) is Moore-Penrose pseudo-inverse of \(X\).
\[\begin{equation} \label{eq:svd-based-mppi} X^{\dagger} = V \Delta^+U^\top, \end{equation}\] where \(\Delta^+\) contains the reciprocals of the non-zero (diagonal) elements of \(\Delta\), and has \(0\) values elsewhere.
This SVD-based representation holds both if \(X^\top X\) is singular
For high dimensional data \(p>>n\) design matrix \(X\) is never full rank.
There are \(60,000\) digits that are called the training set and \(10,000\) additional digits that are called the test set.
The input pixels are greyscale, with a value of 0.0 representing white, a value of 1.0 representing black, and in between values representing gradually darkening shades of grey.
First, we need to download and decompress the following files. Their format is described on the website.
We use R to extract the information from these files.
file_training_set_image <- "train-images.idx3-ubyte"
file_training_set_label <- "train-labels.idx1-ubyte"
file_test_set_image <- "t10k-images.idx3-ubyte"
file_test_set_label <- "t10k-labels.idx1-ubyte"
extract_images <- function(file, nbimages = NULL) {
if (is.null(nbimages)) {
# We extract all images
nbimages <- as.numeric(paste("0x", paste(readBin(file, "raw", n = 8)[5:8], collapse = ""), sep = ""))
}
nbrows <- as.numeric(paste("0x", paste(readBin(file, "raw", n = 12)[9:12], collapse = ""), sep = ""))
nbcols <- as.numeric(paste("0x", paste(readBin(file, "raw", n = 16)[13:16], collapse = ""), sep = ""))
raw <- readBin(file, "raw", n = nbimages * nbrows * nbcols + 16)[-(1:16)]
return(array(as.numeric(paste("0x", raw, sep = "")), dim = c(nbcols, nbrows, nbimages)))
}
extract_labels <- function(file) {
nbitem <- as.numeric(paste("0x", paste(readBin(file, "raw", n = 8)[5:8], collapse = ""), sep = ""))
raw <- readBin(file, "raw", n = nbitem + 8)[-(1:8)]
return(as.numeric(paste("0x", raw, sep = "")))
}
images_training_set <- extract_images(file_training_set_image, 60000)
images_test_set <- extract_images(file_test_set_image, 10000)
labels_training_set <- extract_labels(file_training_set_label)
labels_test_set <- extract_labels(file_test_set_label)
labels_training_set[1:10] [1] 5 0 4 1 9 2 1 3 1 4
# par(ask = TRUE)
par(mfrow = c(2, 2))
for (i in 1:4) image(as.matrix(rev(as.data.frame(images_training_set[, , i]))), col = gray((255:0)/256))Each training input \(\mathbf{x}\) is a \(28\times 28 = 784\)-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image.
We’ll denote the corresponding desired output by \(\mathbf{y} = y(\mathbf{x})\), where \(\mathbf{y}\) is a 10-dimensional vector. For example, if a particular training image, \(\mathbf{x}\), depicts a 6, then \(y(\mathbf{x})=(0,0,0,0,0,0,1,0,0,0)\top\) is the desired output from the network.
We consider each image as a vector and obtain different least squares estimators for each type of digit. For digit \(\ell\in{0,\ldots,9}\) we collect all the training data vectors, with \(y_i=\ell\). Then for each such \(i\), we set \(y_i = +1\) and for all other \(i\) with \(y_i\neq \ell\), we set \(y_i = −1\).
This labels our data as classifying yes digit vs. not digit. Call this vector of −1 and +1 values \(y^{(\ell)}\) for every digit \(\ell\).
We then compute, \[\beta^{(\ell)}=(X^TX)^{-1}X^\top y^{\ell}\ \ \ \ \ \textrm{for} \ \ \ \ell=0,\ldots,9\] where \(X\) is the \(60,000\times 784\) design matrix associated with the \(60,000\) images.
Now for every image \(i\), the inner product \(\beta^{(\ell)}.\mathbf{x}^\top_i\) yields an estimate of how likely this image is of the digit \(\ell\). A very high value indicated a high likelihood and a low value is a low likelihood. We then classify an arbitrary image \(\tilde{\mathbf x}\) by
\[\begin{eqnarray} \hat{y}(\tilde{x}) = \textrm{arg max } \beta^{(\ell)}.\tilde{x}^\top \ \ \ \ \ \ \ (1) \end{eqnarray}\]
Observe that during the training, this classifier only requires calculating \((X^TX)^{-1}X^T\) once. It then only needs to remember 10 vectors \(\beta_0,\ldots,\beta_9\). Then based on these 10 vectors, a decision rule is very simple to execute in (1).
Create Design matrix and outcome responses
vectorized_result <- function(j) {
e <- as.matrix(rep(0, 10))
e[j + 1] <- 1
return(e)
}
Xtrain <- matrix(0,nrow=60000,ncol=784)
Ytrain <- matrix(0,nrow=60000,ncol=10)
for (i in 1:60000) {
Xtrain[i,] <- as.vector(images_training_set[,,i]) / 256
Ytrain[i,] <- t(vectorized_result(labels_training_set[i]))
}
Ytrain[which(Ytrain==0)] <- -1
Xtest <- matrix(0,nrow=10000,ncol=784)
Ytest <- matrix(0,nrow=10000,ncol=10)
for (i in 1:10000) {
Xtest[i,] <- as.vector(images_test_set[,,i]) / 256
Ytest[i,] <- t(vectorized_result(labels_test_set[i]))
}
pred.model <- function(x,Xtest=Xtest){sum(x*Xtest)}
linear.class <- function(model,Xtest){
res <- apply(model,MARGIN=2,FUN=pred.model,Xtest=Xtest)
pred <- which.max(res)-1
return(pred)
}result <- 0
res <- rep(0,10000)
for (i in 1:10000){
Xtest1 <- (as.vector(images_test_set[,,i]) / 256)
resi <- linear.class(model,Xtest=Xtest1)
res[i] <- resi
if(abs(resi-labels_test_set[i])<0.5) result <- result +1
}
accuracy <- result/10000
accuracy[1] 0.8534
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 942 0 17 4 0 20 17 5 17 18
1 0 1107 56 15 23 17 9 38 54 10
2 2 2 809 26 6 2 10 18 9 2
3 2 2 28 887 3 84 0 8 32 15
4 1 1 16 2 872 19 21 20 27 72
5 7 1 0 14 5 624 20 0 42 1
6 15 5 42 9 10 22 872 1 15 1
7 2 2 21 21 2 13 0 877 12 76
8 7 15 39 21 13 69 9 3 743 13
9 2 0 4 11 48 22 0 58 23 801
We have used the one vs.\ all (also known as one vs.\ rest) strategy using Binary classification for multi-class classification problem (see later)
Second option: one vs. one strategy. We have \(K(K-1)/2\) binary classifiers denoted \({f}_{{\theta}_{i,j}}(x)\) for all \(i,j = 1,\ldots,K\) such that \(i \neq j\). Here the \((i,j)th\) classifier discriminates between the label being of index \(i\) or index \(j\)
The one vs. all or one vs. one strategies carry out prediction via,
\[ \widehat{\cal Y} = \begin{cases} \textrm{argmax}_{i=1,\ldots,K} {f}_{{\theta}_i}(x) & \qquad \text{in case of one vs. all}, \\ \textrm{argmax}_{i=1,\ldots,K} \sum_{j \neq i} \text{sign}\big( {f}_{{\theta}_{i,j}}(x) \big)& \qquad \text{in case of one vs. one}. \\ \end{cases} \]
\[\begin{equation} \label{poly} y = \beta_0 + \beta_1 x + \beta_2 x^2 + \ldots + \beta_k x^k+ \epsilon \end{equation}\]
\(\mathcal{M}_k\) means polynomial model of order \(k\)
Hence \(\mathcal{M}_0\) is the constant model, \(\mathcal{M}_1\) is the simple linear model, \(\mathcal{M}_2\) is the quadratic model, and so on.
Model complexity corresponds to the degree of the polynomial model.
\(\mathcal{D}_{\text{train}}\) of size \(n_{\text{train}}=10\).
load("data-poly.Rdata")
y <- mydata$y
x <- mydata$x
model0 <- glm(y~1)
model1 <- glm(y~poly(x,1))
model2 <- glm(y~poly(x,2))
model3 <- glm(y~poly(x,3))
model4 <- glm(y~poly(x,4))
model5 <- glm(y~poly(x,5))
model6 <- glm(y~poly(x,6))
model7 <- glm(y~poly(x,7))
model8 <- glm(y~poly(x,8))
model9 <- glm(y~poly(x,9))
# Much better
model.list <- vector("list", length = 10)
model.list[[1]] <- glm(y~1)
for (i in 1:9){
formula <- as.formula(paste("y ~ poly(x,",i,")",sep=""))
model.list[[i+1]] <- glm(formula)
}
my.formula <- y ~ 1
p0 <- ggplot(mydata, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE,
formula = my.formula,
colour = "red")+
labs(title=element_text("Polynomial fit with k=0"))+
theme(legend.position = "none") + theme_bw()
my.formula <- y ~ poly(x, 1, raw = TRUE)
p1 <- ggplot(mydata, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE,
formula = my.formula,
colour = "red")+
labs(title=element_text("Polynomial fit with k=1"))+
theme(legend.position = "none") + theme_bw()
my.formula <- y ~ poly(x, 3, raw = TRUE)
p3 <- ggplot(mydata, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE,
formula = my.formula,
colour = "red")+
labs(title=element_text("Polynomial fit with k=3"))+
theme(legend.position = "none")+ theme_bw()
my.formula <- y ~ poly(x, 9, raw = TRUE)
p9 <- ggplot(mydata,aes(x, y=y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE,
formula = my.formula,
colour = "red")+
labs(title=element_text("Polynomial fit with k=9"))+
theme(legend.position = "none") + theme_bw()
library(ggpubr)
ggarrange(p0,p1,p3,p9,ncol = 2, nrow = 2)It is obvious that as \(k\) increases training fit improves.
Evaluate the performance of the model using a performance measure \({\cal P}\)
\[\begin{equation} \label{eq:e-train-train} {E}_{\text{train}} = \frac{1}{n_{\text{train}}}\sum_{(x,y) \in \mathcal{D}_{\text{train}}} {\cal P} \big(\widehat{y}(x \, ; \, \mathcal{D}{\text{train}}),\, y\big), \end{equation}\]
RMSE <- function(model,dataXY){
pred <- predict.glm(model,newdata=data.frame(x=dataXY$X))
RMSE <- mean((pred-dataXY$Y)**2)
return(RMSE)
}
data.train <- data.frame(X=x,Y=y)
RMSE.result <- matrix(NA,nrow=10,ncol=2)
for(i in 1:10){
model <- model.list[[i]]
RMSE.result[i,] <- c(i-1,RMSE(model,data.train))
}
data.result <- as.data.frame(RMSE.result)
colnames(data.result) <- c("M","Train")
mydata <- data.frame(Model=data.result$M,Performance=data.result$Train,Group=rep("Train",each=10),color=rep("red",each=10))
linetype = c('solid', 'solid')
LegendTitle = ""
p <- ggplot(mydata, aes(Model,Performance, group = Group,color=color))+
geom_line(aes(group = Group,color=color))+#,"red")) +
scale_color_identity(labels = c(blue = "blue",red = "red"))+
#scale_linetype_manual(name = LegendTitle, values = linetype) +
scale_x_continuous(name="Model (k)",breaks=0:9,labels=0:9)+
scale_y_continuous(name="Mean Square Error")+
theme_bw() +
annotate(geom="text", x=7.5, y=1, label="E_train ",color="red")+
annotate("segment", x = 6.6, xend = 7, y = 1, yend = 1,
colour = "red")
pIt is obvious that as \(k\) increases training fit improves.
Hypothetical example: we know the underlying process with probability law \(P(x,y)\) used for purposes of simulation of synthetic data
We may sample as many \((x^\star,y^\star)\) pairs as we wish, to obtain the performance measure for unseen data \(E_{\text{unseen}}\).
\(10,000\) repetitions for each \(k\), each time with the fixed model based on our single available dataset.
set.seed(1123)
testx <- seq(0,1,length=10000)
testy <- sin(2*pi*testx)-cos(2*pi*testx)+rnorm(10000,0,0.2)
data.test <- data.frame(X=testx,Y=testy)
RMSE.result <- matrix(NA,nrow=10,ncol=3)
for(i in 1:10){
model <- model.list[[i]]
RMSE.result[i,] <- c(i-1,RMSE(model,data.train),RMSE(model,data.test))
}
colnames(RMSE.result) <- c("M","Train","Test")
data.result <- as.data.frame(RMSE.result)
colnames(RMSE.result) <- c("M","Train","Test")
mydata <- data.frame(Model=rep(data.result$M,2),Performance=c(data.result$Train,data.result$Test),Group=rep(c("Train","Production"),each=10),color=rep(c("red","black"),each=10))
linetype = c('solid', 'solid')
LegendTitle = ""
p <- ggplot(mydata, aes(Model,Performance, group = Group,color=color))+
geom_line(aes(group = Group,color=color))+#,"red")) +
scale_color_identity(labels = c(blue = "blue",red = "red"))+
scale_x_continuous(name="Model (k)",breaks=0:9,labels=0:9)+
scale_y_continuous(name="Mean Square Error")+
theme_bw() +
annotate(geom="text", x=7.5, y=1, label="E_train", parse=TRUE
,color="red")+
annotate("segment", x = 6.6, xend = 7, y = 1, yend = 1,
colour = "red")+
annotate(geom="text", x=7.6, y=0.95, label="E_unseen", parse=TRUE
,color="black")+
annotate("segment", x = 6.6, xend = 7, y = 0.95, yend = 0.95,
colour = "black")
pIt clear that when \(k=9\) or \(k=8\) there is over-fitting and when \(k=0,1,2\) there is under-fitting.
In practice plots cannot be produced because we do not know \(P(x,y)\). Instead one can resort to estimates based on cross validation to obtain curves similar to the black curve
Behaviour of expected generalization performance and expected training performance
bias and variance tradeoff.
Special case of the square error performance function \({\cal P}(\hat{y}, y) = (\hat{y} - y)^2\) and a specifically assumed underlying random reality
\[\begin{equation} \label{eq:process-real} y= f(x) + \epsilon, \quad \text{with} \quad \textrm{E}[\epsilon]=0, \quad \text{and }\epsilon\text{ is independent of }x. \end{equation}\]
\(x\) a vector of features and \(y\) a scalar real valued label.
For some unseen feature vector \(x^\star\), the predictor trained on data \(\mathcal D\) is \(\widehat{y}(x^\star \, ;\, \mathcal D)\), which we also denote via \(\hat{f}(x^\star \, ; \, \mathcal D)\)
The expected generalization performance
\[\begin{equation} \label{eq:e-unseen-on-way} \widetilde{E}_{\text{unseen}} ={\mathbb E}_{\mathcal D, x^\star, \epsilon} \Big[ \big(\hat{f}(x^\star;\mathcal D)-(f(x^\star)+\epsilon) \big)^2\Big]. \end{equation}\]
\[\begin{equation} \label{eq:biasVarTrad} \widetilde{E}_{\text{unseen}}=\underbrace{\left({\mathbb E}[\hat{f}(x^\star\, ; \, \mathcal D)]-{\mathbb E}[f(x^\star)]\right)^{2}}_{\text{Bias squared of~} \hat{f}(\cdot)} ~+~~ \underbrace{\text{Var} \big(\hat{f}(x^\star\, ; \, \mathcal D)\big)}_{\text{Variance of}~\hat{f}(\cdot)} ~~~+~ \underbrace{{\mathbb E}[\epsilon^2]}_{\text{Inherent Noise}}. \end{equation}\]
The model bias is a measure of how a typical model \(\hat{f}(\cdot \, ; \, \mathcal D)\) misspecifies the correct relationship \({f}(\cdot)\).
The model variance is a measure of the variability of the model class \(\hat{f}(\cdot\, ; \, \mathcal D)\).
Model classes with high model variance are often overfit (to the training data) and do not **{generalize (to unseen data) well.
For example in a classification setting you may compare the accuracy obtained on the training set to that obtained on a validation set. If there is a high discrepancy where the training accuracy is much higher than the validation accuracy, then there is probably a variance problem indicating that the model is overfitting.
Control model variance is to induce or force model parameters to remain within some confined subset of the parameter space.
This is called regularization.
A decreases in model variance may imply an increase of model bias.
A common way to keep model parameters at bay is to augment the optimization objective \(\min_{\theta} C(\theta \, ; \,\mathcal D)\) with an additional \(R_\lambda(\theta)\).
The revised objective is t \[\begin{equation} \label{eq:regularization} \min_{\theta} ~C(\theta \, ; \, \mathcal D) + R_\lambda(\theta). \end{equation}\]
\(R_\lambda(\theta)\) depends on a regularization parameter \(\lambda\)
This hyper-parameter allows us to optimize the bias and variance tradeoff.
A common general regularization technique called elastic net has regularization parameter \(\lambda = (\lambda_1, \lambda_2)\) and,
\[\begin{equation} \label{eq:e-net} R_\lambda(\theta) = \lambda_1 \| \theta \|_1 + \lambda_2 \| \theta\| ^2 \quad \text{with} \quad \|\theta\|_1 = \sum_{i=1}^{d} | \theta_i | ~~ \text{and} ~~ \|\theta\|^2 = \sum_{i=1}^{d} \theta_i^2, \end{equation}\]
With \(\lambda_1, \lambda_2 = 0\) the original objective is unmodified.
In contrast, as \(\lambda_1 \to \infty\) or \(\lambda_2 \to \infty\) the estimates \(\theta_i \to 0\) and any information in the data \(\mathcal D\) is fully ignored.
Particular cases of elastic net are the classic ridge regression: \(\lambda_1 = 0\)
For \(\lambda_2 = 0\) we have the LASSO (least absolute shrinkage and selection operator)
With LASSO, the penalty \(|| \theta ||_1\) allows the algorithm to remove variables from the model
Hence LASSO is very useful as a model selection technique.
The regularization parameter \(\lambda\) is a hyper-parameter that one would like to calibrate during learning.
load("Breast_cancer.RData")
library(glmnet)
data <- Breast_cancer_data
data <- data %>% mutate(diagnosis_0_1 = ifelse(diagnosis == "M", 1, 0))
library(caret)
set.seed(101)
training <- createDataPartition(data$diagnosis_0_1, p=0.8, list=FALSE)
train <- data[ training, ]
test <- data[ -training, ]
trainfull <- train[,-c(1,2)]
testfull <- data[ -training, -c(1,2)]
model.lasso <- glmnet(trainfull[,-31],trainfull[,31],family="binomial",alpha=1)
model.ridge <- glmnet(trainfull[,-31],trainfull[,31],family="binomial",alpha=0)
par(mfrow=c(2,2))
plot(model.ridge,xvar="lambda")
plot(model.lasso,xvar="lambda")ind.lambda.r <- which.min(model.ridge$dev.ratio<0.8)
#model.ridge$beta[,ind.lambda.r]
ind.lambda.l <- which.min(model.lasso$dev.ratio<0.8)
#model.lasso$beta[,ind.lambda.l]
res <- cbind(model.ridge$beta[,ind.lambda.r],model.lasso$beta[,ind.lambda.l])
colnames(res) <- c("ridge","lasso")
res ridge lasso
radius_mean 7.594749e-02 0.00000000
texture_mean 5.975381e-02 0.00000000
perimeter_mean 1.089765e-02 0.00000000
area_mean 7.228806e-04 0.00000000
smoothness_mean 8.288914e+00 0.00000000
compactness_mean 1.841181e+00 0.00000000
concavity_mean 2.873464e+00 0.00000000
concave.points_mean 6.984665e+00 13.18732706
symmetry_mean 3.438559e+00 0.00000000
fractal_dimension_mean -1.535919e+01 0.00000000
radius_se 9.001367e-01 1.01502088
texture_se -1.638285e-02 0.00000000
perimeter_se 1.006807e-01 0.00000000
area_se 5.020945e-03 0.00000000
smoothness_se -1.333716e+01 0.00000000
compactness_se -4.755243e+00 0.00000000
concavity_se -5.804244e-01 0.00000000
concave.points_se 1.118136e+01 0.00000000
symmetry_se -3.620015e+00 0.00000000
fractal_dimension_se -5.602774e+01 0.00000000
radius_worst 6.523028e-02 0.38973127
texture_worst 4.997564e-02 0.09262034
perimeter_worst 9.048926e-03 0.00000000
area_worst 5.036682e-04 0.00000000
smoothness_worst 1.017587e+01 5.03527345
compactness_worst 9.392522e-01 0.00000000
concavity_worst 1.227223e+00 0.47407610
concave.points_worst 4.624727e+00 14.10470598
symmetry_worst 3.643479e+00 2.51759237
fractal_dimension_worst 4.145747e+00 0.00000000
WARNING: calibrating the model choice and the hyper-parameters while reusing the testing set for performance evaluation is bad practice !!!
For this reason it is common to further split the training data \(\mathcal D_{\text{train}}\)
\(\mathcal D_{\text{test}}\) is still reserved only for final performance evaluation before rolling out the model to production.
Two main approaches, the train-validate split approach and K-fold cross validation.
The train-validate split approach is common in situations where the total number of datapoints \(n\) is large.
The K-fold cross validation approach is useful when data is limited.
The train-validate split implies that the original data with \(n\) samples is first split to training and testing. Then the training data is further split into two subsets: the first is (confusingly) again called the training set and the latter is the validation set.
\[ \mathcal D = \mathcal D_{\text{train}} \cup \mathcal D_{\text{validate}} \cup \mathcal D_{\text{test}}, \quad \text{where the unions are of disjoint sets.} \]
For example, we use a 80-20 rule for both splits and assuming divisibility holds, then \(n_{\text{test}} = 0.2 \times n\), \(n_{\text{train}} = 0.64 \times n\) and \(n_{\text{validation}} = 0.16 \times n\).
Alternative approach is K-fold cross validation.
lasso and ridge models## Example Lasso and ridge regression
## Breast cancer
### A practice on Breast Cancer Data
model.lasso <- cv.glmnet(as.matrix(trainfull[,-31]),trainfull[,31],family="binomial",alpha=1,type.measure = "class")
model.ridge <- cv.glmnet(as.matrix(trainfull[,-31]),trainfull[,31],family="binomial",alpha=0,type.measure = "class")
plot(model.lasso)best model on test datares.l <- predict(model.lasso,s = "lambda.min",newx=as.matrix(testfull[,-31]),type="response")
classifier.l <- ifelse(res.l>0.5,"1","0")
table(classifier.l,testfull[,31])
classifier.l 0 1
0 73 2
1 0 38
result.lasso <- caret::confusionMatrix(factor(classifier.l),factor(testfull[,31]),positive="1")
result.lasso$table Reference
Prediction 0 1
0 73 2
1 0 38
Accuracy
0.9823009
Sensitivity Specificity Precision Recall
0.95 1.00 1.00 0.95
multinomial regression model, softmax regression, softmax logistic regression, multinomial logistic regression, multi-class logistic regression generalize sigmoid (logistic model) from binary response case to the case of \(K>2\) classes.
\(\mathcal D=\left\{(x^{(1)},y^{(1)}),\ldots,(x^{(n)},y^{(n)})\right\}\) with each label \(y^{(j)}\) is one of \(K\) class values \(\{1,\ldots,K\}\).
Aim: predict class probability vectors, or if used for classification, to predict a class in a multi-class setting.
The predicted response is the probability vector \[\begin{equation} \label{eq:bold-phi-phi-x} \boldsymbol{\phi}(x)=\big(\phi_1(x), \ldots, \phi_K(x) \big), \end{equation}\]
where \[\begin{equation} \label{eq:phi-mult-case} \phi_k(x) = P(Y = k ~|~ X=x) \quad \text{for} \quad k=1,\ldots,K. \end{equation}\]
and further, \[\begin{equation} \label{eq:stat-param2} \phi_K(x)=1-\sum_{j=1}^{K-1}\phi_j(x). \end{equation}\]
\[\begin{equation} \label{eq:star-rep-min-mult-model-theta} \theta=(b_1,\ldots,b_{K-1},w_{(1)},\ldots,w_{(K-1)})\in \underbrace{\Re\times\ldots \times\Re}_{K-1~\text{times}} \times \underbrace{\Re^p\times\ldots\times\Re^p}_{K-1~\text{times}} := \Theta, \end{equation}\]
\[\begin{equation} \label{eq:softmax-in-mult} S_{\text{softmax}}(z) = \frac{1}{\sum_{i=1}^{K} e^{z_i}} \begin{bmatrix} e^{z_1} \\ \vdots\\ e^{z_{K}}\\ \end{bmatrix}. \end{equation}\]
For any vector \(z \in \Re^K\), the result of \(S_{\text{softmax}}(z)\) is a probability vector
Now apply the softmax function to a vector of length \(K\) that has \(b_k + w_{(k)}^\top x\) at the \(k\)th coordinate.
\[\begin{equation} \label{eq:softeq} \hat{y}=\underbrace{S_{\text{softmax}} \big(\overbrace{b+W x}^{z\in \Re^K}\big)}_{a\in \Re^K}, \end{equation}\]
where \(\hat{y}\) is the predictionvector of \(\boldsymbol{\phi}(x)\) with
\[\begin{equation} \label{eq:mult-params-vec-mat} b = \left[ \begin{matrix} b_1\\ \vdots\\ b_K \end{matrix} \right], \qquad \text{and} \qquad {W}= \begin{bmatrix} ~\text{------} & w_{(1)}^\top & \text{------} \\[5pt] ~\text{------} & w_{(2)}^\top & \text{------} \\[5pt] & \vdots & \\[5pt] ~\text{------} & w_{(K)}^\top & \text{------}\\ \end{bmatrix}. \end{equation}\]
- Classifier: an output vector \(\hat{y}\) is a probability vector.
\[\begin{equation} \label{eq:argmax-mp} \widehat{\cal Y}=\underset{k \in\{1, \ldots, K\}}{\operatorname{argmax}} ~\hat{y}_k. \end{equation}\]
library(glmnet)
Ytrain[which(Ytrain==-1)] <- 0
fit=cv.glmnet(Xtrain[1:1000,],Ytrain[1:1000,],family = "multinomial",type.measure = "class")
plot(fit)library(IMIFA)
pred=predict(fit,Xtest,s=fit$lambda.min,type="class")-1
par(mfrow=c(2,3))
for(i in 1:6){
show_digit(Xtest[i,])
title(sprintf("prediction = %s",pred[i]))
}[1] 0.8313
library(nnet)
model <- multinom(Ytrain[1:20000,] ~., family = "multinomial", MaxNWts =10000000, maxit=50,data=as.data.frame(Xtrain[1:20000,]));# weights: 7860 (7065 variable)
initial value 46051.701860
iter 10 value 9835.249814
iter 20 value 7772.741852
iter 30 value 6666.130513
iter 40 value 5630.991143
iter 50 value 4902.647609
final value 4902.647609
stopped after 50 iterations
results <- predict(model, newdata=Xtest, type='probs')
prediction <- max.col(results)
prediction <- prediction - 1
cl <- mean(prediction != labels_test_set)
print(paste('Accuracy', 1 - cl))[1] "Accuracy 0.894"
Back to the case \(k=2\) with a sigmoid model
Consider a scenario with \(p=2\) features
load("versatile-boundaries.RData")
library(ggplot2)
ggplot(data) + geom_point(aes(x = x,y = y,color = as.character(label)), size = 1) +
theme_bw(base_size = 15) +
xlim(-2.5, 1.5) +
ylim(-2.5, 2.5) +
coord_fixed(ratio = 0.8) +
theme(axis.ticks=element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text=element_blank(),
axis.title=element_blank(),
legend.position = "none")train_data <- data
trainX <- train_data[, c(1, 2)]
trainY <- train_data[, 3]
trainY <- ifelse(trainY == 1, 0, 1)
data.glm <- data.frame(Y=trainY,X1=trainX[,1],X2=trainX[,2])
model.logistic <- glm(Y~X1+X2,data=data.glm,family=binomial)
## As a reminder we check the shape of our classifier
step <- 0.01
x_min <- min(trainX[, 1]) - 0.2
x_max <- max(trainX[, 1]) + 0.2
y_min <- min(trainX[, 2]) - 0.2
y_max <- max(trainX[, 2]) + 0.2
grid <- as.matrix(expand.grid(seq(x_min, x_max, by = step), seq(y_min,
y_max, by = step)))
data.grid <- data.frame(X1=grid[,1],X2=grid[,2])
Z <- predict(model.logistic,newdata=data.grid,type="response")
Z <- ifelse(Z <0.5, 1, 2)
g1 <- ggplot() +
geom_tile(aes(x = grid[, 1], y = grid[, 2], fill = as.character(Z)), alpha = 0.3, show.legend = F)+
geom_point(data = train_data, aes(x = x, y = y, color = as.character(trainY)),size = 1)+
theme_bw(base_size = 15) + coord_fixed(ratio = 0.8) +
labs(x=expression(x[1]),cex=2) +labs(y=expression(x[2]))+
theme(axis.ticks = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), legend.position = "none",axis.title=element_text(size=18,face="italic"))
g1\[ P[Y=1|X_1=x_1,X_2=x_2]=\frac{1}{1+e^{-(b+w_1x_1+w_2x_2)}} \]
\[\begin{equation} \label{eq:quad-decision-boundry} \tilde{b} + \tilde{w}_1 x_1 + \tilde{w}_2 x_2 + \tilde{w}_3 x_1^2 + \tilde{w}_4 x_2^2 + \tilde{w}_5 x_1 x_2 \ge 0. \end{equation}\]
glm.fit: fitted probabilities numerically 0 or 1 occurred
Z <- predict(model.logistic.nl2,newdata=data.grid,type="response")
Z <- ifelse(Z <0.5, 1, 2)
g2 <- ggplot() +
geom_tile(aes(x = grid[, 1], y = grid[, 2], fill = as.character(Z)), alpha = 0.3, show.legend = F)+
geom_point(data = train_data, aes(x = x, y = y, color = as.character(trainY)),size = 1)+
theme_bw(base_size = 15) + coord_fixed(ratio = 0.8) +
theme(axis.ticks = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.text = element_blank(),
axis.title = element_blank(), legend.position = "none")+
theme(axis.title=element_text(size=18,face="italic"))
g2+ labs(x=expression(x[1]),cex=2) +labs(y=expression(x[2]),cex=2)glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
glm.fit: fitted probabilities numerically 0 or 1 occurred
Z <- predict(model.logistic.nl3,newdata=data.grid,type="response")
Z <- ifelse(Z <0.5, 1, 2)
par(mfrow=c(1,2))
g2 <- ggplot() +
geom_tile(aes(x = grid[, 1], y = grid[, 2], fill = as.character(Z)), alpha = 0.3, show.legend = F)+
geom_point(data = train_data, aes(x = x, y = y, color = as.character(trainY)),size = 1)+
theme_bw(base_size = 15) + coord_fixed(ratio = 0.8) +
theme(axis.ticks = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.text = element_blank(),
axis.title = element_blank(), legend.position = "none")+
theme(axis.title=element_text(size=18,face="italic"))
g2+ labs(x=expression(x[1]),cex=2) +labs(y=expression(x[2]),cex=2)
Z <- predict(model.logistic.nl4,newdata=data.grid,type="response")
Z <- ifelse(Z <0.5, 1, 2)
g3 <- ggplot() +
geom_tile(aes(x = grid[, 1], y = grid[, 2], fill = as.character(Z)), alpha = 0.3, show.legend = F)+
geom_point(data = train_data, aes(x = x, y = y, color = as.character(trainY)),size = 1)+
theme_bw(base_size = 15) + coord_fixed(ratio = 0.8) +
labs(x=expression(x[1]),cex=2) +labs(y=expression(x[2]))+
theme(axis.ticks = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), legend.position = "none",axis.title=element_text(size=18,face="italic"))
g3+ labs(x=expression(x[1]),cex=2) +labs(y=expression(x[2]),cex=2)higher orders might gain more expressivity but Higher order models yield obscure classification decision boundaries.
Certainly appear like an over-fit.
New set of features could become highly correlated and that can cause difficulty in inference of parameters.
clr1 <- c(rgb(1,0,0,1),rgb(0,1,0,1),rgb(0,0,1,1),rgb(0.5,0.5,0.5,1))
clr2 <- c(rgb(1,0,0,.1),rgb(0,1,0,.1),rgb(0,0,1,.1),rgb(0.5,0.5,0.5,.5))
x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85,0.2,0.7,0.3,0.5,0.6,0.55,0.4)
y <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3,0.7,0.5,0.3,0.7,0.6,0.65,0.7)
z <- c(1,2,2,2,1,0,0,1,0,0,1,2,1,3,3,3,3)
df <- data.frame(x,y,z)
plot(x,y,pch=19,cex=2,col=clr1[z+1],ylab=expression(x[2]),xlab=expression(x[1]),cex.lab=1.4)#write.csv(df,file="synthetic_data_4_class.csv",row.names = FALSE)
library(nnet)
model.mult <- multinom(z~x+y,data=df)# weights: 16 (9 variable)
initial value 23.567004
iter 10 value 8.943075
iter 20 value 7.332678
iter 30 value 7.299143
iter 40 value 7.245881
iter 50 value 7.232151
iter 60 value 7.180802
iter 70 value 7.165842
iter 80 value 7.135699
iter 90 value 7.116795
iter 100 value 7.096799
final value 7.096799
stopped after 100 iterations
pred_mult <- function(x,y){
res <- predict(model.mult,
newdata=data.frame(x=x,y=y),type="probs")
apply(res,MARGIN=1,which.max)
}
x_grid<-seq(0,1,length=601)
y_grid<-seq(0,1,length=601)
z_grid <- outer(x_grid,y_grid,FUN=pred_mult)
image(x_grid,y_grid,z_grid,col=clr2,ylab=expression(x[2]),xlab=expression(x[1]),cex.lab=1.4)
points(x,y,pch=19,cex=2,col=clr1[z+1])legend("topleft", inset=0.02, legend=c("class 1", "class 2","class 3","class 4"),
col=clr1[c(1,3,2,4)], cex=0.9, pch=c(19,19))# weights: 28 (18 variable)
initial value 23.567004
iter 10 value 8.124558
iter 20 value 6.923017
iter 30 value 6.507148
iter 40 value 5.878525
iter 50 value 5.455237
iter 60 value 5.288689
iter 70 value 4.723312
iter 80 value 4.631339
iter 90 value 4.520595
iter 100 value 4.348960
final value 4.348960
stopped after 100 iterations
x_grid<-seq(0,1,length=601)
y_grid<-seq(0,1,length=601)
z_grid <- outer(x_grid,y_grid,FUN=pred_mult)
image(x_grid,y_grid,z_grid,col=clr2,ylab=expression(x[2]),xlab=expression(x[1]),cex.lab=1.4)
points(x,y,pch=19,cex=2,col=clr1[z+1])legend("topleft", inset=0.02, legend=c("class 1", "class 2","class 3","class 4"),
col=clr1[c(1,3,2,4)], cex=0.9, pch=c(19,19))# weights: 184 (135 variable)
initial value 23.567004
iter 10 value 5.859693
iter 20 value 4.064641
iter 30 value 2.910536
iter 40 value 0.007991
final value 0.000063
converged
x_grid<-seq(0,1,length=601)
y_grid<-seq(0,1,length=601)
z_grid <- outer(x_grid,y_grid,FUN=pred_mult)
image(x_grid,y_grid,z_grid,col=clr2,ylab=expression(x[2]),xlab=expression(x[1]),cex.lab=1.4)
points(x,y,pch=19,cex=2,col=clr1[z+1])legend("topleft", inset=0.02, legend=c("class 1", "class 2","class 3","class 4"),
col=clr1[c(1,3,2,4)], cex=0.9, pch=c(19,19))Learning activity involves optimization
Learning is the process of seeking model parameters that are ``best’’ for some given task.
Loss function is generally used as proxy for minimization of the actual performance measures that are of interest
Example: classification problems aim to get the most accurate classifier. In such a case, optimization is typically not carried out directly on the accuracy measure but rather on a loss function such as the mean square error, or cross entropy
For binary outcome, we use binary cross entropy
\[\begin{equation} \label{eq:bin-cross-enty} \text{CE}_{\text{binary}}({y},\hat{y}) =- \Big( y \log{\big(\hat{y} \big)}+(1-y )\log{\big(1-\hat{y} \big)}\Big), \end{equation}\]
\[ C_i(\theta)=({y} - \hat{y})^2. \]
For categorical outcome, we generally use the categorical cross entropy. For a label \(y \in \{1,\ldots,K\}\) and a probability vector \(\hat{y}\) of length \(K\): % \[\begin{equation}
\label{eq:cat-cross-entr}
\text{CE}_\text{categorical}(y, \hat{y}) = - \sum_{k=1}^K {{\mathbf 1}\{y = k\}} \log \hat{y}_k
\end{equation}\]
Why is matter to choose the appropriate loss function ? loss landscape of Sigmoid model
- (a) Using the squared loss and (b) Using the binary cross entropy loss
Some Benefits of Cross Entropy Loss:
Optimization of the loss function through Gradient Descent algorithms
This gradient descent procedure executes over iterations indexed by \(t=0,1,2,\ldots\)
Works by taking small steps in the direction opposite to the gradient.
That is, it traverses downhill each time trying to descend in the steepest direction.
Steps are determined by a fixed \(\alpha > 0\) called the learning rate.
After some initialization \(\theta^{(0)} = \theta_{init}\), in each iteration \(t\), the next vector \(\theta^{(t+1)}\) is obtained via,
\[\begin{equation} \label{eq:gdls1} \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla C(\theta^{(t)}). \end{equation}\]
Convergence can depend of the learning rate choice !!!
Standardization of Inputs can help the optimization
Illustration of gradient descent on Wisconsin breast cancer dataset
We split he observations into \(n_{\text{train}} = 456\) and \(n_{\text{test}}=113\).
We use a model with \(p=10\) features (\(d=11\)) and learn the parameters via gradient descent with some arbitrary initilization.